Smoking-associated gene expression alterations in nasal epithelium reveal immune impairment linked to lung cancer risk

Background Lung cancer is the leading cause of cancer-related death in the world. In contrast to many other cancers, a direct connection to modifiable lifestyle risk in the form of tobacco smoke has long been established. More than 50% of all smoking-related lung cancers occur in former smokers, 40% of which occur more than 15 years after smoking cessation. Despite extensive research, the molecular processes for persistent lung cancer risk remain unclear. We thus set out to examine whether risk stratification in the clinic and in the general population can be improved upon by the addition of genetic data and to explore the mechanisms of the persisting risk in former smokers. Methods We analysed transcriptomic data from accessible airway tissues of 487 subjects, including healthy volunteers and clinic patients of different smoking statuses. We developed a computational model to assess smoking-associated gene expression changes and their reversibility after smoking is stopped, comparing healthy subjects to clinic patients with and without lung cancer. Results We find persistent smoking-associated immune alterations to be a hallmark of the clinic patients. Integrating previous GWAS data using a transcriptional network approach, we demonstrate that the same immune- and interferon-related pathways are strongly enriched for genes linked to known genetic risk factors, demonstrating a causal relationship between immune alteration and lung cancer risk. Finally, we used accessible airway transcriptomic data to derive a non-invasive lung cancer risk classifier. Conclusions Our results provide initial evidence for germline-mediated personalized smoke injury response and risk in the general population, with potential implications for managing long-term lung cancer incidence and mortality. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-024-01317-4.

The activity level of the 4 TFs that regulate a high number of risk and GWAS genes, but this time calculated for the Bronchial samples only, on a gene network that is inferred from the Bronchial samples.As in the nasal samples, we found no differences between clinic patients with and without cancer.
We did not collect bronchial tissue from healthy volunteers for ethical reasons.
Orange: cancer patients, purple: no cancer patients.P-values indicated at the top of each panel were calculated using a two sample t-test.

Fig. S1
Fig. S1 Smoke injury reversibility analysis.(a) The slope coefficients associated to the three smoking status variables included in the Bayesian model (CS: current smoker status, FSS: former smoker status, FS: former smoker's time since quit).(b) Description of model selection procedure used to assign each gene to a reversibility class: the table shows all possible combinations of inclusion/exclusion of the three smoking status variables.(c) In blue, yellow and red, schematic of a gene with altered expression in current compared to never smokers, and the three possible trajectories after smoking cessation, corresponding to the RR, SR and IR reversibility classes; in green, schematic of a gene with no expression different in current versus never smokers, but altered expression in former smokers, corresponding to the CA class.US: not affected; RR: rapidly reversible (blue) SR: slowly reversible (yellow); IR:irreversible (red); CA: cessation-associated (green).

Fig. S2
Fig. S2 Principal component analysis on the genes belonging to different reversibility classes.RR: Rapidly reversible genes, SR: Slowly reversible genes; IR:irreversible genes.Each small dot is a patient and colors indicate the smoking status of the patient.Large dots represent the mean of all patients for each smoking class.(a): nasal samples from healthy volunteers, using the reversibility classes from the bayesian model on the healthy volunteer group.Since only 2 genes were classified as IR, PCA was performed jointly for SR and IR genes (b): nasal samples from clinic subjects (cancer + benign), using the reversibility classes from the bayesian model on the clinic group.

Fig. S7 :Fig. S10 :
Fig. S7: Clinic and population scores stratified for different clinical variables.Distribution of the clinic (Top rows) and population (bottom rows) risk score in subjects depending on (a) The type of cancer (b) the stage of the cancer (c) the COPD status.Color of the dot indicate for each individual subject his status, namely healthy volunteer (green), clinic benign (orange) or clinic cancer (purple)

Fig. S11 :
Fig.S11: Genes robustly contributing to the population and clinic risk scores.The weight of the genes selected in more than 80% of cross validations in the population (purple dot) and clinic (light green dot) classifiers.The plot presents the mean weight of each gene over all cross validations and the error bars represent standard deviation; the annotation track on the right shows the reversibility classes of the genes in the clinic groups (left column) and in the healthy volunteers (right column).grey: not affected by smoke; blue: rapidly reversible, yellow: slowly reversible, red: irreversible; green : cessationassociated.When the gene is not robustly contributing (i.e. is selected in less than 80% of the cross validation experiments ) in both models, we present its weight in one model only.

Fig. S12
Fig. S12 Pathway metascore depending on the smoking status.(a) Differences in pathways' metascores response to smoke in clinic patients (plain line, circle dot) or in healthy volunteers (dotted line, triangle dot) for four GO terms involved in smoke injury response in nasal epithelium ; (b) Differences in pathways' metascores for immune-related genesets in healthy volunteers (left panels) and in the clinic group (right panels) .Former smokers with time since quit > 1 year were divided into former smokers who quit <= 10 years and >10 years before sample collection.Geneset metascores were calculated by averaging the nasal expression of genes belonging to each GO term.Red: Current smokers, blue: ex smokers between 1 and 12 months after smoking cessation, green: ex smokers between 1 and years after smoking cessation, purple: ex smokers over 10 years after smoking cessation, orange:

Fig. S14 :
Fig. S14: Combined environmental and genetic effect on gene expression in nasal tissues.Plots are shown for 9/10 genes with a significant interaction effect between smoking status and the genotype of the patient at the lead eQTL position (one gene shown in main text), we present the expression level of the gene separately for never, former and current smokers.Samples are further stratified depending on the genotype of the subject at the corresponding lead eQTL locus (pink: homozygous reference; green: heterozygous; blue homozygous Alternative).P-values and SNP position are given in Supp Table6.

Fig. S15 :
Fig. S15: (Equivalent to Fig 6b)The activity level of the 4 TFs that regulate a high number of risk and GWAS genes, but this time calculated for the Bronchial samples only, on a gene network that is inferred from the Bronchial samples.As in the nasal samples, we found no differences between clinic patients with and without cancer.We did not collect bronchial tissue from healthy volunteers for ethical reasons.Orange: cancer patients, purple: no cancer patients.P-values indicated at the top of each panel were calculated using a two sample t-test.

Fig. S16
Fig. S16 Exploratory analysis.(a) PCA computed on all genes for every sample before (left) and after (right) VST-normalization.Each dot represent a sample, colored depending on the two experimental batches.Batch 2 is in blue, batch one was further stratified into 2 sets depending on the sequencing coverage (red: high coverage, green: low coverage).Shapes of the dot depend on the tissue of origin of the sample (triangle: nasal sample, circle: bronchial sample) (b) Strength and significance of association between experimental batch and clinical covariates; for each pair of covariates Cramer's V value (light red for values close to 0, dark red for values close to 1) and chi-square test pvalue are reported (*: P <= 0.05, **: P <= 0.01, ***: P <= 0.001).(c) Contribution of different clinical variables to the total explained variance in gene expression calculated using a random model on nasal samples.